Introduction
Ever wondered how education shapes the lifestyle of an individual? You can be one degree away from a high paying job or in great student loan debt. I am curious to understand how the level of education and the stream/ major shapes the opportunities of an individual. In the analysis below I will tackle some general questions to lay the ground and dive into inference statistics to understand the price tag of college majors and if you should buy or pass.
The research question?
What major should a student planning to pursue a undergraduate degree in the United Sates of America choose, in order to make the most out of the investment in earnings.
Does the gender of a student have any influence on the major the student would pursue and the how do their earnings look.
Why do I care?
I remember being very confused on what major to pick post my high school. I went along with my interests and it has worked out fine for me so far. But I feel it would have been helpful to have a analysed report to make conscious decisions as these decisions pretty much pave the way for the career.Being a female student in STEM, I would like to how does the share of females look like for STEM and other majors.
Why should others care?
Pursuing a degree is a huge commitment and often involves hefty investment. Which could in most cases involve leaving the current job and taking student loan. Thus it is crucial to analyse what opportunities the degree could offer financially post completion.
DATA
Data source:
College majors data is from American Community Survey 2010-2012 Public Use Microdata Series. The data contains 5 files segregated based on level of education and age.
Since the College majors data consists of only population and median salary over different categories, I could not think of a suitable regression model that could be applied on the data. So Instead I gathered the data of estimated total number of students taking Science and Engineering Major over the years 2010 through 2019 from the same American Community Survey Public Use Microdata Series.
College majors data:
link to data:
majors-list.csv:
- List of majors with their FOD1P codes and major categories.
recent-grads.csv:
- File contains the basic earnings and labor force information and detailed breakdown, including by sex and by the type of job they got.
Science_and_Engineering_through_years:
science-engineering-2010-2019.csv:
- File contains the year and the estimated total number of students taking Science and Engineering Major.
Data collection:
The ACS PUMS files are a set of records from individual people or housing units, with disclosure protection enabled so that individuals or housing units cannot be identified. The data is from an observational study from 2010-2012 for “college majors” and from 2010-1019 for “Science_and_Engineering_through_years” data.
Units of observations:
majors-list description
| Header | Description |
|---|---|
| FOD1P | Recorded field of degree - first entry |
| Major_code | Major code, FO1DP in ACS PUMS |
| Major | Major description |
recent-grads description
| Header | Description |
|---|---|
| Rank | Rank by median earnings |
| Major_code | Major code, FO1DP in ACS PUMS |
| Major | Major description |
| Major_category | Category of major from Carnevale et al |
| Total | Total number of people with major |
| Sample_size | Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
| Men | Male graduates |
| Women | Female graduates |
| ShareWomen | Women as share of total |
| Employed | Number employed (ESR == 1 or 2) |
| Full_time | Employed 35 hours or more |
| Part_time | Employed less than 35 hours |
| Full_time_year_round | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |
| Unemployed | Number unemployed (ESR == 3) |
| Unemployment_rate | Unemployed / (Unemployed + Employed) |
| Median | Median earnings of full-time, year-round workers |
| P25th | 25th percentile of earnings |
| P75th | 75th percentile of earnings |
| College_jobs | Number with job requiring a college degree |
| Non_college_jobs | Number with job not requiring a college degree |
| Low_wage_jobs | Number in low-wage service jobs |
science-and-engineering-2010-2019 description
| Header | Description |
|---|---|
| year | Year of census estimate |
| Science_Engineering | Estimated total number of students taking Science and Engineering Major |
Variables:
The primary set of files that will be used are majors-list.csv, recent-grads.csv and science-and-engineering-2010-2019. All the variables in majors-list.csv, recent-grads.csv will be used in the study.
Type of study:
The data is from an observational study from year 2010 to 2012.
Data clean-up:
Load Packages:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(naniar)
Load data
majors_list <- read.csv("./majors-list.csv")
recent_grads <- read.csv("./recent-grads.csv")
se_data <- read.csv("./se_2010_2019.csv")
majors_list <- as_tibble(majors_list)
recent_grads <- as_tibble(recent_grads)
se_data <- as_tibble(se_data)
se_data <- se_data %>%
rename(
Science_Engineering_Total = Science_Engineering
)
Check for missing values
Lets check for any missing values and decide on how to handle them across data
vis_miss(majors_list)
The NA in majors list data represents education level below bachelors degree
vis_miss(recent_grads)
The only major category with partial missing data is FOOD SCIENCE of Agriculture & Natural Resources. It does not contain data about the total number of men, women and share of women.
vis_miss(se_data)
There is no missing data for both year and Science_Engineering_Total
Exploratory Data Analysis
Lets explore the recent grads data first followed by the grad_students(grad and Non grad data)
Recent Graduates:
1. What are the most popular 15 majors?
majors_sorted <- recent_grads %>%
arrange(desc(Median)) %>%
mutate(Major = str_to_title(Major),
Major = fct_reorder(Major, Median))
majors_sorted %>%
mutate(Major = fct_reorder(Major, Total)) %>%
arrange(desc(Total)) %>%
head(15) %>%
ggplot(aes(Major, Total, fill = Major_category)) +
geom_col() +
coord_flip() +
scale_y_continuous() +
labs(x = "",
y = "Total # of graduates")
Psychology seems to be the most popular Major among recent grads, followed by Business Management and Administration. Although category Psychology tops the list, 1/4 of the majors are from Business category.
2. What are the least popular 15 majors?
majors_sorted %>%
mutate(Major = fct_reorder(Major, Total)) %>%
arrange(desc(Total)) %>%
tail(15) %>%
ggplot(aes(Major, Total, fill = Major_category)) +
geom_col() +
coord_flip() +
scale_y_continuous() +
labs(x = "",
y = "Total # of graduates")
## Warning: Removed 1 rows containing missing values (position_stack).
Military Technologies is to be the least popular in Majors with less than 500 people taking it. Categories Engineering and Education together dominate the least popular majors.
Which major categories earn the most?
majors_sorted %>%
group_by(Major_category) %>%
summarize(Total = median(Total))%>%
mutate(Major_category = fct_reorder(Major_category, Total)) %>%
ggplot(aes(Major_category, Total)) +
geom_col() +
scale_y_continuous() +
coord_flip()
## Warning: Removed 1 rows containing missing values (position_stack).
majors_sorted %>%
mutate(Major_category = fct_reorder(Major_category, Median)) %>%
ggplot(aes(Major_category, Median, fill = Major_category)) +
geom_boxplot() +
scale_y_continuous(labels = dollar_format()) +
expand_limits(y = 0) +
coord_flip() +
theme(legend.position = "none")
Its interesting that the least popular majors earn the most while the most popular majors are down the scale. Business and Social Science are the only majors in the popular 5 and has highest median pay.
How is the gender distribution across popular majors?
majors_sorted %>%
arrange(desc(Total)) %>%
head(15) %>%
mutate(Major = fct_reorder(Major, Total)) %>%
gather(Gender, Number, Men, Women) %>%
ggplot(aes(Major, Number, fill = Gender)) +
geom_col() +
coord_flip()
How does the gender distribution correlate with earnings?
by_major_category <- majors_sorted %>%
filter(!is.na(Total)) %>%
group_by(Major_category) %>%
summarize(Men = sum(Men),
Women = sum(Women),
Total = sum(Total),
MedianSalary = median(Median)) %>%
mutate(ShareWomen = Women / Total) %>%
arrange(desc(ShareWomen))
library(ggrepel)
by_major_category %>%
mutate(Major_category = fct_lump(Major_category, 8)) %>%
ggplot(aes(ShareWomen, MedianSalary, color = Major_category)) +
geom_point() +
geom_smooth(aes(group = 1), method = "lm") +
geom_smooth(method = "lm") +
geom_text_repel(aes(label = Major_category)) +
expand_limits(y = 0)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
library(ggrepel)
majors_sorted %>%
ggplot(aes(ShareWomen, Median, color = Major_category)) +
geom_point() +
geom_smooth(aes(group = 1), method = "lm") +
expand_limits(y = 0)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Most STEM programs have least share of women and high pay while Health and Arts have most share of women. Business and Social Science seem to be right at the center with almost equal share of women and men.
How is the total estimated Science and Engineering over the years
plot(se_data$Year, se_data$Science_Engineering_Total , main="Science and Engineering Major Estimated Total over years 2010-2019",
xlab="year", ylab="Science and Engineering Major Estimated Total", pch=19)
Inference
indata <- majors_sorted %>%
filter(!is.na(Total)) %>%
group_by(Major_category) %>%
summarize(Men = sum(Men),
Women = sum(Women),
Total = sum(Total),
Employed = sum(Employed),
Full_time = sum(Full_time),
Part_time = sum(Part_time),
Full_time_year_round = sum(Full_time_year_round),
Unemployed = sum(Unemployed),
Unemployed_rate = sum(Unemployed)/sum(Unemployed) + sum(Employed),
College_jobs = sum(College_jobs),
Non_college_jobs = sum(Non_college_jobs),
Low_wage_jobs = sum(Low_wage_jobs),
MedianSalary = median(Median),
Emp_total = sum(Employed)+ sum(Unemployed)) %>%
mutate(ShareWomen = Women / Total) %>%
arrange(desc(Employed))
indata <- as_tibble(indata)
eng_tech <- filter(indata, Major_category =="Engineering" | Major_category =="Computers & Mathematics")
business <- filter(indata, Major_category == "Business")
arts <- filter (indata, Major_category == "Arts")
Two proportion Z-test:
- Applicability of the approach
- Formulation
- checks of assumptions
- validation of the approach(regression)
- Interpretation of the results
Applicability of the approach:
The data only contains the total of different categories and median salary from samples. As we do not have information on mean, we use the two proportion z-test.
Formulation:
Two sample Z-test (Upper tailed): Hypothesis: The employment rate of people who took TEM(Technology, Engineering and Mathematics) is less than or equal to those who took business. - H0 : pe \(\le\) pb - HA : pe > pb - let significance level \(\alpha\) = 0.05
Decision rule: Since \(\alpha\) = 0.05, and we are using z-test, the decision rule is to “Reject H0 if Z \(\ge\) 1.645
Checks of Assumption:
here, Let e denote: Engineering, Computer and Mathematics Major and, b denotes: Business
Since the value of z is given by \[z = \frac { p_e - p_b}{\sqrt{ p * (1- p) *(1/n_e + 1/n_b)}}\]
\[ p_e = \frac{Employed_e}{Employed_e + Unemployed_e} \implies p_e = 0.9317863\] \[ p_b = \frac{Employed_b}{Employed_b+ Unemployed_b} \implies p_b = 0.9316484\] \[ p = \frac{ Employed_e + Employed_b}{(Employed_e + Unemployed_e) +(Employed_b+ Unemployed_b)} \implies p = 0.9317003 \]
\[n_e = 706456, n_b = 1168619\]
Since, to perform a two proportions z test is valid only when sample size (n) is large enough. \(n_ep\), \(n_e(1-p)\), \(n_bp\), \(n_b(1-p)\) should be \(\ge\) 5. Here, as \(n_ep= 658205.3\) , \(n_e(1-p) = 48250.71\), \(n_bp = 1088803\) and \(n_b(1-p) = 79816.29\), we can proceed to perform z test on the data.
The employment rate of people who took TEM(Technology, Engineering and Mathematics) is less than or equal to those who took business.
Lets gather the required data into a tibble for ease of reference.
temp_vs_b <- eng_tech %>%
summarize(Men = sum(Men),
Women = sum(Women),
Major_category = "Engineering, Computers & Mathematics",
Total = sum(Total),
Employed = sum(Employed),
Full_time = sum(Full_time),
Part_time = sum(Part_time),
Full_time_year_round = sum(Full_time_year_round),
Unemployed = sum(Unemployed),
Unemployed_rate = sum(Unemployed)/sum(Unemployed) + sum(Employed),
College_jobs = sum(College_jobs),
Non_college_jobs = sum(Non_college_jobs),
Low_wage_jobs = sum(Low_wage_jobs),
MedianSalary = median(MedianSalary),
Emp_total = sum(Employed)+ sum(Unemployed)) %>%
mutate(ShareWomen = Women / Total) %>%
arrange(desc(Employed))
temp_vs_b <- temp_vs_b %>% add_row(business)
temp_vs_a <- temp_vs_b %>% add_row(arts)
Test for hypothesis, H0 : pe \(\le\) pb
res_temp_vs_b <- prop.test(x=c(temp_vs_b$Employed[1],temp_vs_b$Employed[2]), n=c(temp_vs_b$Emp_total[1],temp_vs_b$Emp_total[2]),alternative = "greater")
res_temp_vs_b
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(temp_vs_b$Employed[1], temp_vs_b$Employed[2]) out of c(temp_vs_b$Emp_total[1], temp_vs_b$Emp_total[2])
## X-squared = 0.12939, df = 1, p-value = 0.3595
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.0004884284 1.0000000000
## sample estimates:
## prop 1 prop 2
## 0.9317863 0.9316484
Interpretation of Result:
From the results of z test, we can see that the p-value of the hypothesis is equal to 0.35, the result is not significant. Thus we fail to reject the hypothesis that the employment rate of people who took TEM(Technology, Engineering and Mathematics) is less than or equal to those who took business.
Regression Analysis:
Application of Approach:
Fit an linear regression model over the time series se_data to predict the number of students taking Science and Engineering majors given the year #### Formulation Predict number of students taking Science and Engineering major given the year. #### Check for assumptions
plot(se_data$Year, se_data$Science_Engineering_Total , main="Science and Engineering Major Estimated Total over years 2010-2019",
xlab="year", ylab="Science and Engineering Major Estimated Total", pch=19)
par(mfrow=c(1, 2)) # divide graph area in 2 columns
boxplot(se_data$Year, main="year", sub=paste("Outlier rows: ", boxplot.stats(se_data$Year)$out)) # box plot for 'Year'
boxplot(se_data$Science_Engineering_Total, main="Science_Engineering_Total", sub=paste("Outlier rows: ", boxplot.stats(se_data$Science_Engineering_Total)$out)) # box plot for 'Science_Engineering_Total'
cor(se_data$Year, se_data$Science_Engineering_Total)
## [1] 0.995729
As the value is close to 1 there is a strong relationship between the variables
se_lr <- lm( Science_Engineering_Total ~ Year,data = se_data)
summary(se_lr)
##
## Call:
## lm(formula = Science_Engineering_Total ~ Year, data = se_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -230108 -189820 3226 151470 358029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.460e+09 4.861e+07 -30.03 1.64e-09 ***
## Year 7.360e+05 2.413e+04 30.50 1.45e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219200 on 8 degrees of freedom
## Multiple R-squared: 0.9915, Adjusted R-squared: 0.9904
## F-statistic: 930.5 on 1 and 8 DF, p-value: 1.448e-09
The residuals look close to symmetrical. The least square estimates of the fitted line has the intercept -1.460e+09 and a slope 7.360e+05 and the p-value of Year is statistically significant as 1.45e-09 is less than 0.05. A significant p-value for the year means that it will give us a reliable guess of the number of people taking Science Engineering Major.
As the R-squared is 0.9915, it means that year can explain 99% of variation in Science_Engineering_total. The Adjusted R-squared is the R-squared scaled by the number of parameters in the model.The Degrees of freedom of the model is 8.
plot(Science_Engineering_Total ~ Year,data=se_data)
abline(se_lr, col="blue")
plot(se_lr)
testdata<- data.frame(Year=2020)
predict(se_lr,testdata)
## 1
## 26923972
AIC(se_lr)
## [1] 278.0983
BIC(se_lr)
## [1] 279.006
Validation of approach
Interpretation of the result
The model is captures the relationship among the variables well.
Conclusion
Disclaimer: The following analysis is performed as a final project for my course and the data is from American Community Survey 2010-2012. There is a high possibility the results might be outdated given the uncertainty of the events around the world.Geography of a place can largely affect the compensation and opportunities.